Exactly solvable statistical physics models for large neuronal populations

Maximum entropy methods provide a principled path connecting measurements of neural activity directly to statistical physics models, and this approach has been successful for populations of N~100 neurons. As N increases in new experiments, we enter an undersampled regime where we have to choose which observables should be constrained in the maximum entropy construction. The best choice is the one that provides the greatest reduction in entropy, defining a “minimax entropy” principle. This principle becomes tractable if we restrict attention to correlations among pairs of neurons that link together into a tree; we can find the best tree efficiently, and the underlying statistical physics models are exactly solved. We use this approach to analyze experiments on N~1500 neurons in the mouse hippocampus, and show that the resulting model captures the distribution of synchronous activity in the network.

It has long been hoped that neural networks in the brain could be described using concepts from statistical physics [1][2][3][4][5][6].More recently, our ability to explore the brain has been revolutionized by techniques that record the electrical activity from thousands of individual neurons, simultaneously [7][8][9][10][11][12][13]. Maximum entropy methods connect these data to theory, starting with measured properties of the network and arriving at models that are mathematically equivalent to statistical physics problems [14,15].In some cases these models provide successful, parameter-free predictions for many detailed features of the neural activity pattern [16,17].The same ideas have been used in contexts ranging from the evolution of protein families to ordering in flocks of birds [18][19][20][21][22][23][24].But as experiments probe systems with more and more degrees of freedom, the number of samples that we can collect typically does not increase in proportion.Here we present a strategy for building maximum entropy models in this undersampled regime, and apply this strategy to data from 1000+ neurons in the mouse hippocampus.
Consider a system of N variables x = {x i }, i = 1, 2, • • • , N .To describe the system, we would like to write down the probability distribution P (x) over these microscopic degrees of freedom, in the same way that we write the Boltzmann distribution for a system in equilibrium.In the example that we discuss below, all the x i are observed at the same moment in time, but we could also include time in the index i, so that P (x) becomes a probability distribution of trajectories.
From these N variables we can construct a set of K operators or observables {f ν (x)}, ν = 1, 2, • • • , K. The maximum entropy approach takes this limited number of observables seriously, and insists that expectation values for these quantities predicted by the model match the values measured in experiment, or more explicitly, where x (m) is the m th sample out of M samples in total.Notice that to have control over errors in the measurement of all K expectation values, we need to have K ≪ M N .There are infinitely many distributions that obey these matching conditions, but the idea of maximum entropy is that we should choose the one that has the least possible structure, or equivalently generates samples that are as random as possible while obeying the constraints in Eq. (1).From Shannon we know that "as random as possible" translates uniquely to finding the distribution that has the maximum entropy consistent with the constraints [25,26].The solution of this optimization problem has the form where the coupling constants λ ν must be chosen to satisfy Eq. ( 1).We emphasize that in using this approach, the arXiv:2310.10860v1[physics.bio-ph]16 Oct 2023 model in Eq. ( 3) is something that needs to be testedin systems such as networks of neurons, there is no Htheorem telling us that entropy will be maximized, nor is there a unique choice for the constraints that would correspond to the Hamiltonian of an equilibrium system.Without a Hamiltonian, how should we choose the observables f ν ?Probability distributions define a code for the data [25] in which each state x is mapped to a code word of length so the mean code length for the data is Combining Eqs. ( 1) and ( 3), ⟨ℓ⟩ exp is exactly the entropy of P .Thus, among all maximum entropy distributions, the one that gives the shortest description of the data is the one with minimum entropy.This "minimax entropy" principle was discussed 25 years ago [27], but has attracted relatively little attention.Every time we add a constraint, the maximum possible entropy is reduced [28], and this entropy reduction is an information gain.Thus the minimax entropy principle tells us to choose observables whose expectation values provide as much information as possible about the microscopic variables.The problem is that finding these maximally informative observables is generally intractable.
To make progress we restrict the class of observables that we consider.With populations of N ∼ 100 neurons, it can be very effective to constrain just the mean activity of each neuron and the correlations among pairs [14,16,17].But this corresponds to K ∝ N 2 constraints, and at large N we will violate the good sampling condition K ≪ N M .To restore good sampling, we try constraining only N c of the correlations, which define links between specific pairs of neurons that together form a graph G.If we describe the individual neurons as being either active or silent, so that x i ∈ {0, 1}, then the maximum entropy distribution is an Ising model on the graph G, where {h i , J ij } must be adjusted to match the measured expectation values ⟨x i ⟩ exp and ⟨x i x j ⟩ exp for pairs (ij) ∈ G.The minimax entropy principle tells us that we should find the graph G with a fixed number of links N c such that P G (x) has the smallest entropy while obeying these constraints.This remains intractable.Statistical physics problems are hard because of feedback loops.If we can eliminate these loops then we can find the partition function exactly, as in one-dimensional systems or on Bethe lattices [29].If the graph G has no (a) Mutual information between variables in a hypothetiacal system.(b) One can build the optimal tree one variable at a time, from an arbitrary starting variable, by iteratively adding the connection corresponding to the largest mutual information (dashed) from the tree (solid) to the remaining variables; this is Prim's algorithm.(c) Optimal tree that minimizes the entropy ST and maximizes the information IT .
loops then it describes a tree T , and among other simplifications we can write the entropy where S ind is the independent entropy of the individual variables, and I ij is the mutual information between x i and x j [Fig.1(a)].Because of the constraints on the maximum entropy distribution, I ij is the same whether we compute it from the model or from the data.Thus we can compute the entropy of the pairwise maximum entropy model on any tree without constructing the model itself.
Minimizing the entropy S T in Eq. ( 7) is equivalent to maximizing the total mutual information This defines a minimum spanning tree problem [15], which admits a number of efficient solutions.For example, one can grow the optimal tree by greedily attaching the new variable x i with the largest mutual information I ij to an existing variable x j [Figs.1(b) and 1(c)]; this is Prim's algorithm, which runs in O(N 2 ) time [30].By restricting observables to pairwise correlations that form a tree, we can solve the minimax entropy problem exactly, even at very large N .Further, we can give explicit expressions for the fields and couplings [15,31], where N i are the neighbors of i on the tree.The total number of constraints is K = 2N − 1, so if the number of independent samples M ≫ 2, then we are well sampled.As emphasized above, our approach yields a model that needs to be tested.At large N , pairwise correlations are a vanishingly small fraction of all possible correlations, and in building a tree we keep only a vanishingly small fraction of these.Does this literal backbone of correlation structure contain enough information to capture something about the behavior of the network as a whole?
We analyze data from an experiment on the mouse hippocampus [32].Mice are genetically engineered so that neurons express a protein whose fluorescence is modulated by calcium concentration, which in turn follows the electrical activity of the cell.Recording electrical activity is then a problem of imaging, which is done with a scanning two-photon microscope as the mouse runs in a virtual environment.The fluorescence signal from each cell consists of a relatively quiet background interrupted by short periods of activity, providing a natural way to discretize into active/silent (x i = 1/0) in each video frame [16].Images are collected at 30 Hz for 39 min, and the field of view includes N = 1485 neurons.This yields M ∼ 7 × 10 4 (non-independent) samples, sufficient to estimate the mutual information I ij with small errors.Among all N (N − 1)/2 ∼ 10 6 pairs of neurons, only 9% exhibit significant mutual information I ij [Fig.2(a)].We see that the distribution of mutual information is heavy-tailed, such that a small number of correlations contain orders of magnitude more information than average ( Ī = 2.9 × 10 −4 bits).Additionally, while most pairs of neurons are negatively correlated [Fig.2(b)], the strongest mutual information belong to pairs that are positively correlated [Fig.2(c)].Together, these observations suggest that a sparse network of positively correlated neurons may provide a large amount of information about the collective neural activity.
Applying our method, we identify the tree of maximally informative correlations [Fig.3(a)], which captures I T = 26.2bits of information; this is more than 50× the average we find on a random tree [Fig.3(b)].Although a tree includes only 2/N ≈ 0.1% of all pairwise correlations, we have I T ≈ 0.144S ind , so that the correlation structure we capture is as strong as freezing the states of 214 randomly selected neurons.Another way to assess the strength of the interactions is to see that on the optimal tree the mean activity of each neuron ⟨x i ⟩ deviates strongly from what is predicted by the field h i alone, This is shown in Fig. 3(d), where we compare the optimal tree with a random tree.So despite limited connectivity, the effective fields are very different from the intrinsic biases h i .
In the model of Eq. ( 6), positive (negative) J ij means that activity in neuron i leads to activity (silence) in neuron j.For random trees, the interactions J ij are split almost evenly between positive and negative [Fig.3(c)];  of the number of simultaneously active neurons K in the data (black), predicted by the maximally informative tree (red), and the Gaussian distribution for independent neurons with mean and variance ⟨K⟩exp (dahsed).To estimate probabilities P (K) and error bars (two standard deviations), we first split the experiment into 1-minute blocks to preserve dependencies between consecutive samples.We then randomly select one third of these blocks and repeat 100 times.For each subsample of the data, we compute the optimal tree T and predict P (K) using a Monte Carlo simulation of the model PT .
this distribution of interactions is consistent with previous investigations of systems of N ∼ 100 neurons where we can estimate and match all of the pairwise correlations [14,17,33].But since the largest mutual information are associated with positive correlations [Fig.2(c)], the maximally informative tree produces strong interactions that are almost exclusively positive and quite large [Fig.3(c)].We have arrived, perhaps surprisingly, at an Ising ferromagnet.
Is it possible that a backbone of ferromagnetic interactions captures some of the collective behavior in the network?One signature of this collective behavior is the probability P (K) that K out of the N neurons are simultaneously active within a window of time [14,17,24,33].For independent neurons, this distribution approximately Gaussian at large N (Fig. 4, dashed), but even in relatively small populations we see strong deviations from this prediction, with both extreme synchrony (large K) and near silence (small K) much more likely than expected from independent neurons [14]; this effect persists in the N ∼ 1500 neurons studied here (Fig. 4, black).The optimal tree captures most of this structure, correctly predicting ∼ 100× enhancements of the probability that K ∼ 50 or more neurons will be active in synchrony (Fig. 4, red).Although the detailed patterns of activity in the system are shaped by competing interactions that are missing from our ferromagnetic backbone, this shows that large-scale synchrony can emerge from a sparse network of the strongest positive correlations.
Thus far, we have focused on a single population of N ∼ 1500 neurons.But as we observe larger populations, how do the maximally informative correlations scale with N ?To address this question, we build populations of increasing size by starting from a single cell and drawing concentric circles of increasing radii (in the spirit of Ref. [17]), then repeating this process starting from each of the different neurons [Fig.5(a)].This construction exploits the fact that the neurons in this region of the hippocampus lie largely in single plane.As the population expands, the independent entropy S ind necessarily increases linearly with N on average.The entropy of any tree model S T , which is an upper bound on the true entropy, is reduced by the total information I T = S ind − S T .In Fig. 5(b) we see that the fractional reduction I T /S ind grows slowly with N for the optimal trees, while on random trees this fraction decays rapidly toward zero.
We can understand the decay of fractional information in random trees because the mutual information between two neurons declines, on average, with their spatial separation, although there are large fluctuations around this average.These fluctuations mean that a tree built by connecting nearest spatial neighbors will be better than random but still substantially suboptimal, as will be explored elsewhere.In contrast, as we consider larger populations we uncover more and more of the large mutual information seen in the tail of Fig. 2(a), and this allows I T /S ind to increase with N on the optimal tree [Fig.5(b)].There is no sign that this increase is saturating at N ∼ 10 3 , suggesting that our minimax entropy framework may become even more effective for larger populations.In summary, it has been appreciated for nearly two decades that the maximum entropy principle provides a link from data directly to statistical physics models, and this is useful in networks of neurons as well as other complex systems [14][15][16][17][18][19][20][21][22][23][24].Less widely emphasized is that we do not have "the" maximum entropy model, but rather a collection of possible models depending on what features of the system behavior we choose to constrain.Quite generally we should choose the features that are most informative, leading to the minimax entropy principle [27].As we study larger and larger populations of neurons we enter an undersampled regime in which selecting a limited number of maximally informative features is not only conceptually appealing but also a practical necessity.
The problem is that the minimax entropy principle is intractable in general.Here we have made progress in two steps.First, following previous successes, we focus on constraining the mean activity and pairwise correlations.Second, we take the lesson of the Bethe lattice and select only pairs that define a tree.Once we do this, the relevant statistical mechanics problem is solved exactly, and the optimal tree can be found in quadratic time [15,31].This means that there is a non-trivial family of statistical physics models for large neural populations that we can construct very efficiently, and it is worth asking whether these models can capture any of the essential collective behavior in real networks.We find that the optimal tree correctly predicts the distribution of synchronous activity (Fig. 4), and these models capture more of the correlation structure as we look to larger networks [Fig.5(b)].The key to this success is the heavy-tailed distribution of mutual information [Fig.2(a)], and this in turn may be grounded in the heavy-tailed distribution of physical connections [34].While these models cannot capture all aspects of collective behavior, these observations provide at least a starting point for simplified models of the much larger systems now becoming accessible to experiments.

FIG. 1 .
FIG.1.Computing the optimal tree of pairwise correlations.(a) Mutual information between variables in a hypothetiacal system.(b) One can build the optimal tree one variable at a time, from an arbitrary starting variable, by iteratively adding the connection corresponding to the largest mutual information (dashed) from the tree (solid) to the remaining variables; this is Prim's algorithm.(c) Optimal tree that minimizes the entropy ST and maximizes the information IT .

FIG. 2 .
FIG. 2. Mutual information in a large population of neurons.(a) Ranked order of all significant mutual information Iij in a population of N = 1485 neurons in the mouse hippocampus [32].Solid line and shaded region indicate estimates and errors (two standard deviations) of Iij.(b) Distribution of correlation coefficients over neuron pairs, with percentages indicating the proportion of positively and negatively correlated pairs.(c) Mutual information Iij versus correlation coefficient, where each point represents a distinct neuron pair.Estimates and errors are the same in (a).

FIG. 3 .
FIG.3.Minimax entropy models of a large neuronal population.(a) The optimal tree for the neurons in Fig.2, and (b) a random tree over the same neurons.In both trees, the central neuron has the largest number of connections, and those on the perimeter are leaves (having one connection) with distance from the central neuron decreasing in the clockwise direction.(c) Distributions of Ising interactions Jij [Eq.(9)].(d) Average activities ⟨xi⟩ versus local fields hi, where each point represents one neuron, and the dashed line illustrates the independent prediction [Eq.(11)].

FIG. 4 .
FIG.4.Predicting synchronized activity.Distribution P (K) of the number of simultaneously active neurons K in the data (black), predicted by the maximally informative tree (red), and the Gaussian distribution for independent neurons with mean and variance ⟨K⟩exp (dahsed).To estimate probabilities P (K) and error bars (two standard deviations), we first split the experiment into 1-minute blocks to preserve dependencies between consecutive samples.We then randomly select one third of these blocks and repeat 100 times.For each subsample of the data, we compute the optimal tree T and predict P (K) using a Monte Carlo simulation of the model PT .

FIG. 5 .
FIG. 5. Scaling with population size.(a) Illustration of our growth process superimposed on a fluorescence image of the N = 1485 neurons in the mouse hippocampus.Starting from a single neuron i, we grow the population of N neurons closest to i (red to yellow).We then repeat this process starting from each of the different neurons.(b) Fraction of the independent entropy IT /S ind explained by the optimal and random trees as a function of population size N .Lines and shaded regions represent median and interquartile range over different central neurons.